[SPARK-40849][SS] Async log purge#38313
Closed
jerrypeng wants to merge 5 commits intoapache:masterfrom
Closed
Conversation
Contributor
|
cc. @zsxwing @xuanyuanking @viirya Appreciate your reviews. Thanks! |
Contributor
|
I'll find a time to review in tomorrow. |
|
Can one of the admins verify this patch? |
HeartSaVioR
approved these changes
Oct 21, 2022
Contributor
HeartSaVioR
left a comment
There was a problem hiding this comment.
+1
Let me wait for a couple more days to seek more eyes of review. I'll merge this in early next week if there is no outstanding comment.
Contributor
Author
|
thanks @HeartSaVioR ! |
Contributor
|
the failed python code gen check is unrelated to this PR, please rebase to make CI green |
Contributor
|
https://github.com/jerrypeng/spark/actions/runs/3293047872/jobs/5447874146 Remaining steps are unrelated to this PR - only license check which is respected in this PR. |
Contributor
|
Thanks! Merging to master. |
SandishKumarHN
pushed a commit
to SandishKumarHN/spark
that referenced
this pull request
Dec 12, 2022
### What changes were proposed in this pull request? Purging old entries in both the offset log and commit log will be done asynchronously. For every micro-batch, older entries in both offset log and commit log are deleted. This is done so that the offset log and commit log do not continually grow. Please reference logic here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539 The time spent performing these log purges is grouped with the “walCommit” execution time in the StreamingProgressListener metrics. Around two thirds of the “walCommit” execution time is performing these purge operations thus making these operations asynchronous will also reduce latency. Also, we do not necessarily need to perform the purges every micro-batch. When these purges are executed asynchronously, they do not need to block micro-batch execution and we don’t need to start another purge until the current one is finished. The purges can happen essentially in the background. We will just have to synchronize the purges with the offset WAL commits and completion commits so that we don’t have concurrent modifications of the offset log and commit log. ### Why are the changes needed? Decrease microbatch processing latency ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes apache#38313 from jerrypeng/SPARK-40849. Authored-by: Jerry Peng <jerry.peng@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Purging old entries in both the offset log and commit log will be done asynchronously.
For every micro-batch, older entries in both offset log and commit log are deleted. This is done so that the offset log and commit log do not continually grow. Please reference logic here
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539
The time spent performing these log purges is grouped with the “walCommit” execution time in the StreamingProgressListener metrics. Around two thirds of the “walCommit” execution time is performing these purge operations thus making these operations asynchronous will also reduce latency. Also, we do not necessarily need to perform the purges every micro-batch. When these purges are executed asynchronously, they do not need to block micro-batch execution and we don’t need to start another purge until the current one is finished. The purges can happen essentially in the background. We will just have to synchronize the purges with the offset WAL commits and completion commits so that we don’t have concurrent modifications of the offset log and commit log.
Why are the changes needed?
Decrease microbatch processing latency
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests